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FORWARD-REVERSE EM ALGORITHM FOR MARKOV CHAINS: 
CONVERGENCE AND NUMERICAL ANALYSIS 

CHRISTIAN BAYER, HILMAR MAI, AND JOHN SCHOENMAKERS 


Abstract. We develop a forward-reverse EM (FREM) algorithm for estimating parame¬ 
ters that determine the dynamics of a discrete time Markov chain evolving through a certain 
measurable state space. As a key tool for the construction of the FREM method we de¬ 
velop forward-reverse representations for Markov chains conditioned on a certain terminal 
state. These representations may be considered as an extension of the earlier work|Bayer| 
|and Schoenmakers||2013| on conditional diffusions. We proof almost sure convergence of 
our algorithm for a Markov chain model with curved exponential family structure. On the 
numerical side we give a complexity analysis of the forward-reverse algorithm by deriving 
its expected cost. Two application examples are discuss to demonstrate the scope of possi¬ 
ble applications ranging from models based on continuous time processes to discrete time 
Markov chain models. 


1. Introduction 


The EM algorithm going back to the seminal paper Laird and Rubin 1 1977) is a very 
general method for iterative computation of maximum likelihood estimates in the setting 
of incomplete data. The algorithm consists of an expectation step (E-step) followed by a 
maximization step (M-step) which led to the name EM algorithm. 

Due to its general applicability and relative simplicity it has nowadays found its way into 
a great number of applications. These include maximum likelihood estimates of hidden 
Markov models in MacDonald and Zucchini [ 19971, non-linear time series models in |Chan| 
and Ledolter 119951 and full information item factor models in Meng and Schilling 1 1996| 


to give just a very limited selection. 

Despite the simplicity of the basic idea of the algorithm its implementation in more 
complex models can be rather challenging. The global maximization of the likelihood in 


the M-step has recently been addressed successfully (see e.g. Meng and Rubin 119931 and 


Liu and Rubin 119941). On the other hand, when the expectation of the complete likeli¬ 


hood is not known in closed form only partial solutions have been given yet. One approach 


developed in Wei and Tanner 119901 uses Monte Carlo approximations of the unknown 
expectation and was therefore named Monte Carlo EM (MCEM) algorithm. As an alterna¬ 


tive procedure the stochastic approximation EM algorithm was suggested in Lavielle and 
[Moulines] fl999| . 
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In this paper we take a completely different route by using a forward-reverse algorithm 
(cf. Bayer and Schoenmakers 1 2013]) to approximate the conditional expectation of the 


complete data likelihood. In this respect we extend the idea from Bayer and Schoenmakers 


]2013| to a Markov chain setting, which is considered an interesting contribution on its 
own. Indeed, Markov chains are more general in a sense, since any diffusion monitored 
at discrete times yields canonically a Markov chain, but not every chain can be embedded 
(straightforwardly) into some continuous time diffusion the other way around. 

The central issue is the identification of a parametric Markov chain model ( X n , n — 
0,1,...) based on incomplete data, i.e. realizations of the model, given on a typically 
course grid of time points, let us say n\,ri 2 ,. ■ ■ n.y. Let us assume that the chain runs through 
and that the transition densities p 9 nm (x,y), n > m, of the chain exist (with p 9 nn (x,y) 
d A (y)), where the unknown parameter 6 has to be determined. The log-likelihood function 
based on the incomplete observations (X, n ,, X„ N ) is then given by 

N -1 

(LI) 1(6, X„J,. . . , X„ N ) = ^ In Pni,n M ( x «i» x »/ +1 ), 

i=0 

with X„ 0 = xq being the initial state of the chain. Then the standard method of maximum 
likelihood estimation would suggest to evaluate 


IV—1 


( 1 . 2 ) 


arg max 1(6, X) = arg max V In p e (X n .,X nM ). 

a a • * 


1=0 


The problem in this approach lies in the fact that usually only the one-step transition den¬ 
sities p e nn+ \(x,y) are explicitly known, while any multi-step density p e nm (x,y) for m > n 
can be expressed as an m — n — 1 fold integral of one-step densities. In particular for larger 
m - n, these multiple integrals are numerically intractable however. 

In the EM approach, we therefore consider the alternative problem 

N -1 tl j +1 - 1 

(1.3) arg max ^ Bin p e jJ+l (Xj,X j+1 ), 

9 i=0 j=n, 

in terms of the “missing data” X IIj+ i, ...,X„. +1 _i, i = 0, ...,N— 1. As such, between two such 
consecutive time points, m and n ,-+1 say, the chain may be considered as a bridge process 
starting in realization X n . and ending up in realization X n . +I (under the unknown parameter 
6 though), and so each term in may be considered as an expected functional of the 
“bridged” Markov chain starting at time n, in (data point) X n ., conditional on reaching (data 
point) X n . +i at time « I+ i. 

We will therefore develop firstly an algorithm for estimating the terms in (1.31 for a 
given parameter 6. This algorithm will be of forward-reverse type in the spirit of the one 
in Bayer and Schoenmakers 120131 developed for diffusion bridges. It should be noted 
here that in the last years the problem of simulating diffusion bridges has attracted much 
attention. Without pretending to be complete, see for example, Bladt and S0rensen||2O14 


Delyon and Hu| 1 2006|, Milstein and Tretyakov 120041, Stinis 1 2011 1 , Stuart et al. |2004 


Schauer et al. [2013|. 


Having the forward-reverse algorithm at hand, we may construct an approximate solu¬ 
tion to in a sequential way by the EM algorithm: Once a generic approximation 6 m is 
constructed after m steps, one estimates 

N -1 ttf+i-1 


6m+ 1 := arg max ^ ^ E\n p e jJ+l (Xf', X^), 
9 ;=o j=m 
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where X Bm denotes the Markov bridge process under the transition law due to parameter 0 m 
and each term 


Elnp B J+l (X B '",X%) 

represents a forward-reverse approximation of 

E In P°j,j+ 1 (Xj m , X*}+ 1 ) 


as a (known) function of 8. 

Convergence properties of approximate EM algorithms have drawn considerable recent 
attention in the literature mainly driven by it’s success in earlier intractable estimation 
problems. An overview of existing convergence results for the MCEM algorithm can be 
fund in Neath | 2013) . Starting from a convergence result for the forward-reverse repre¬ 
sentation for Markov chains we prove almost sure convergence of the FREM sequence 


in the setting of curved exponential families based on techniques developed in Bayer and 


|Schoenmakers| |20 131 and |Fort and Mou lines 12003 ]. Essentially the only ingredient from 
the Markov chain model for this convergence to hold are exponential tails of the transition 
densities and their derivatives that are straightforward to check in many examples. 

Since computational complexity is always an issue in estimation techniques that involve 
a simulation step, we also include a complexity analysis for the forward-reverse algorithm. 
We show that the algorithm achieves an expected cost of the order 0(N log N ) for sample 
size of N forward-reverse trajectories. 


We mention a recent application paper by Bayer, Moraes, Tempone, and Vilanova Bayer 


et al. |2015| , which focuses on practical and implementation issues of the forward-reverse 


EM algorithm in the setting of Stochastic Reaction Networks (SRNs), i.e., of continuous 
time Markov chains with discrete, but generally infinite state space. 

The structure of the paper is as follows. In Section [2] we recapitulate and adapt the 
concept of reversed Markov chains, initially developed in Milstein et al. j 1 2007) using the 
ideas in Milstein et al. 120041 on reversed diffusions. A general stochastic representation 
— involving standard (unconditional) expectations only — for expected functionals of 
conditional Markov chains is constructed in Section [3] This representation allows for a 
forward reverse EM algorithm that is introduced and analyzed in Section [4] In Section [5] 
we proof almost sure convergence of the forward-reverse EM algorithm in the setting of 
curved exponential families. Implementation and Complexity of the FREM algorithm are 
addressed in Section [6] The paper is concluded with two application examples in Section 
[Tjthat demonstrate the wide scope of our method. 


2. Recap of forward and reverse representations for Markov chains 

Consider a discrete-time Markov process (X„, T r n ), n = 0,1,2,..., on a probability space 
(£2, J r , P) with phase space (S,S), henceforth called Markov chain. In general we assume 
that S is locally compact and that S is the Borel cr-algebra on S. For example, S =M. d or a 
proper subset of R rf . Let P„, n > 0, denote the one-step transition probabilities defined by 

(2.1) P n (x,B) :=P(X n+ i e B\X n = x), n = 0,1,2,..., xeS,BeS. 

In the case of an autonomous Markov chain all the one-step transition probabilities coincide 
and are equal to P := P 0 = P\ = ■ ■ ■. 

Let X"f, m > n, be a trajectory of the Markov chain which is at step n in the point x, 
i.e., X"' x = x. The multi-step transition probabilities P„ m are then defined by 

P„ im (x, B) := PPC* g B), xeS, B e S, m > n. 
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Due to these definitions, P n n {x,B) = 6 X (B) = 1 b(x) (Dirac measure), P„ = P „,„+\, and the 
Chapman - Kolmogorov equation has the following form: 

(2.2) P nm (x,B) = j' P n , k (x,dy)P k , m (y, B), x e S , B e S, n <k< m. 

Let us fix N > 0 and consider for 0 < n < N the function 


(2.3) 


U n (x) 


-! 


PnAx,dy)f(y) = Bf(X n ’ x ), 


where / is ^-measurable and such that the mathematical expectation in (2.31 exists; for 
example, / is bounded. By the Markov property we have for 0 < n < N : 

u n ix) = EfiX n - x ) = EfiX n f : 1 - X "' : ') 


r„ +1 „v» +1 -*£i) = E f(X n ; hX "h 


= E E r "*'f(X N 
= Bu n+l iX™) = J 

Thus, u„(x) satisfies the following discrete integral Cauchy problem 


u„ +l (y)P„(x,dy). 


(2.4) 

(2.5) 


Unix) 


-f 


u n+1 (y)P„ix,dy), n < N, 


UNix) = fix), 


and (2.3 1 is a forward probabilistic representation of its solution. In fact, the probabilistic 
representation ( |2.3[ > can be used for simulating the solution of ( |2.4[ i-( [23) i by Monte Carlo. 
For our purpose, reverse probabilistic representations we need a somewhat more general 
version of the above result. 


Theorem 2.1 (cf. Milstein et al. 12007 1). Let P„ be the one-step transition density of a 
Markov chain X as in (2.11 and let the function f : S —» R be measurable and bounded. 
Let further ip„ : S xS —> R be a measurable and bounded functions for n = 0,1,2,... Then, 
the solution of the problem 


( 2 . 6 ) 

(2.7) 


W„ix) 


-s 


W n+ iiz)iPnix,Z)Pnix,dz), H < N, 


W N ix) = fix) 
has the following probabilistic representation: 

(2.8) w„ix) = E[fiX n /)X n /’'], 

where (X, X) is an extended Markov chain in which X is governed by the equations 


y n,x.y _ yii.x.y (Y n,x Y n,x yti.x.y _ 

X k + 1 - X k w x k ’ X k+ 1 )’ x n -y. 


where n < k < N. 
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Proof. Note that X k ,x,y = yX n k ,x ' . Thus, for n < N, (2.8 ) may be written as 
w n (x) = E 


/*/ \ y n+ 1 ’ X n+l ^n+’l 


r ti+i,x n '* 


)X\ 


ti+\X£ v i 
N 


N > tx N 

= e[C'^.C,)] 

W n+ \(z)iPn(x,Z)Pn(x, dz), 

and \2.1) is trivially fulfilled for n — N. □ 

2.1. Reverse probabilistic representations. We henceforth take (S ,S) = (p.^, S(R d )) 
and assume that the transition probabilities P nm (x, dy) have densities p n . m (x,y) with respect 
to the Lebesgue measure on ( S,S ). We note however that without any problem one may 
consider more general state spaces equipped with some reference measure, and transition 
probabilities absolutely continuous to with respect to it. The representation (2.31 can thus 
be written in the form 


-/■ 


(2.9) 


1(f ) := E f(X‘ 


) x ) = J' P,i,n(x, 


y)f(y)dy, o <n<N. 


Let the initial value p of the chain X at moment n be random with density g(x). Consider 
the functional 


( 2 . 10 ) 


Kg J) 


ff 


g(x)p n . N (x, y)f(y)dxdy = E 


Formally, by taking for g a 5-function we obtain \2.9) again, and by taking / to be a 6- 
function we obtain the integral 


( 2 . 11 ) 


J(g) 




g(x)p„, N (x,y)dx. 


We now propose suitable (reverse) probabilistic representations for Jig), where g is an 
arbitrary test function (not necessarily a density). For this we are going to construct a class 
of reverse Markov chains that allow for a probabilistic representation for the solution of 

ra. 

Let us fix a number ff e N and consider for 0 < m < N, functions \J/ m : S x S —» R+ 
such that for each m and y the function 

(2.12) q m (y,-) := 0 < m < N, 

m(y, ■) 

is a density on S. For example, one could take t j/ m independent of the second argument, 
and then obviously 


(2.13) 


4>miy) 


f 


p N - m -i(z,y)dz. 


We now introduce a “reverse” processes (Y y m , JJ y m h< m <N by the system 

’ Y m+ 1 6 d A Y m=z) = q m (z,z')dz. 




Y } 0 := Y ( y := y, := tf* 1 := 1, 0 < m < N, 


/y ._ u 0 '-' 1 ’ 1 •_ 


(2.14) 
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hence Y y is governed by the one-step transition probabilities Q,„(z,dz') := q m {z,z')dz' (i.e. 
Q m instead of P m ). 


Theorem 2.2. For any n. 0 < n < N, ( 2.11 1 has the following probabilistic representation. 

g(x)p„, N (x,y)dx = E , 




where g is an arbitrary test function (a ’’density” p mm has to be interpreted as a Dirac 
distribution or 5-function). 


Proof From the Chapman - Kolmogorov equation (2.2 1 we obtain straightforwardly the 
Chapman-Kolmogorov equation for densities. 


(2.15) 


Pn.m(x,y) 


J' Pn.k(x, 


z)Pk,m(z„y)dz„ x,y e S, n < k < m. 


Let us now fix n, n < N (for n — N the statement is trivial) also, and introduce the functions 


(2.16) 


From (2.15 i we get 
(2.17) 


Vk(y) 


Vk(y) 


-f 


g(x)p lu k(x,y)dx, n < k < N. 


/' 


Vk-\{z)pk-\(z,y)dz, n <k < N, 

' ; HCy) = g(y), 

where pi, \ := Pk-\,k denote the one-step densities. For n < k < N we now consider a 
“reversed” time variable m = N + n-k and write with V m (y) := vjy +n - m (y) and (2.12 1 system 
( |2. 1 7[ > in the form 

(2.18) V, n (y) = Jv m+ \(z)f m - n (y,z)q m -n(y,z)dz, n < m < N, 

Lv(>’) = g(y)- 

Let us write ( 2.1 8| > in a slightly different form, 

v m (y) = Jv m+ i(z)if% ) (y,z)q% > (y,z)dz, n<m<N, 

Lv(y) = g(y) 


with ij/fi f m - n and qj" 1 := q m - n . Via Theorem 2.1 we next obtain a probabilistic repre¬ 
sentation of the form ( |2.8[i for the solution of problem ( 2.1 8| >, hence ( |2. 1 1 [ i or J(g). Indeed, 
by taking in Theorem 2.1 


instead of X a Markov chain ( Yl'^’ y ) , where Y (n ' ),y is gov- 

V m l n<m<N & 

erned by the one-step transition probabilities Qm\z,dz r ) := qm(z,z!)dz!, n < m < N, with 


initial condition Y^’ y = y, and constructing (j/%) 


according to 
^ n < m < N, 


<m<N 

(n),y 


(2.19) 

it follows by Theorem |2.1| that 

(2.20) J(g) = v n (y) = v N (y) = E [g^’ y )^ y ]. 
It remains to see that 
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which follows from the fact that initial values and the one step transition probabilities of 
the processes 


y»),v\ and fry yy\ 

V »+' ’ "+' )i= 0 . N-n V i ‘ / ,7V—« 


coincide. 


It should be stressed that, in contrast to a corresponding theorem in Milstein et al. 

1 2007) , Theorem 2.2 provides a family of probabilistic representations indexed by n = 
1,... ,1V, that involves only one common reverse process Y y . In Theorcm |2.2| ;V was fixed 
but, when different N are in play, we will denote them by K v:,v . It turns out that this exten¬ 
sion of the related result in Milstein et al.' 1 2007| is crucial for deriving probabilistic repre¬ 
sentations for conditional Markov chains below (cf. Bayer and Schoenmakers |20131). 


3. Simulation of conditional expectations via forward-reverse representations 


In this section we describe for a Markov Chain (2.1 1 an efficient procedure for estimat 


ing the final distributions of a chain X = {X n ) n =Q....,N conditioned, or pinned, on a terminal 
state X^. More specifically, for some given (unconditional) process X we aim at simulation 
of the functional 


(3.1) 


E [g(X mi ,..., X mr )\ X N = y, X 0 = x], 


where 0 < m\ < mi < • • • < m r < N (hence r < N), g is an arbitrarily given suitable 
test function, and x,y e are given states. The procedure proposed below is in fact an 
extension of the method developed in Bayer and Schoenmakers | 2013) t o discrete time 
Markov chains. We note that similar techniques as in Bayer and Schoenmakers 1 201 3| 
also allow us to treat the more general problem 


E , ■ • • , X,n r )| X N € A, X0 - x] 


for suitable sets A c 


3.1. Forward-reverse representations of conditional expectations. Let us consider the 
problem (3.1 1 for fixed x,y e R d (i.e. A = {y}). We firstly state the following central 
theorem. 

Theorem 3.1. Given a grid £)/ := {0 < n* < n\ < • ■ ■ < n/ =: N], it holds that 

IF T f( y- v; "' Y y,m Y y ’ n ' 't 1 

[7 V-* m—n o’ ± n/-n\ ’ • * * ’ J «/—«/_i7°'«/—no J 

= f(yo >7-1) FI Pn,_, ,n, (Vi- 1 , )’i)dyi -1 

J R dxL Li 

with yi := y and «o := n*. 


Proof. Without loss of generality, we assume in this proof that the grid satisfies n, - n,_i = 
1,1=1,...,/. Indeed, extend / : R dxl —* R to a function / : R such that 

f(y y ’ N Y r ’ N Y y ’ N Y y ’ N \ - f lv r,N Y t ' N Y y ' N ) 

J y 1 N-n*’ 1 N-n’-V " ’ 1 2 ,J 1 ) J m~no > 1 «;-«!>•••’ 1 m-ni-i) • 

Then, re-expressing the transition densities p„._ i n . in terms of the one-step transition densi¬ 
ties pi using Chapman-Kolmogorov, we see that the statement of the theorem is equivalent 
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to 


/ 

J R‘ 




(3.2) e[/(f^„ Yf_ n ,_ v ..., Yf)yf_ n ,\ 

f 0v> ■ ■ ■ ,y N -0 M P‘ \(yi-uyi)dy,-\ 

'-»*) , 1 , A 1 

with = y . In fact, we shall prove that 

C n 

(3.3) E \f p (Yf, Kf' v ) yf | = I f p (y N . p ,y N -,) f| p^(y^,yi)d yi - 

^ i=N-p +1 

for any 1 < p < N — n* for any (e.g., bounded measurable) function f p : R dXp —> R. 
gives the formula from the statement of the theorem for p = N — n* with /v „■ being the 
function / from above. We prove ( |3.3| ) by induction on p. For p — 1, this boils down to 
Theorem |Z2] with n = N - 1. 

For the step from p - 1 to p, we note that by definition 

with -)q p -\(y, •) = PN-{p-\)-i(-,y) = PN- P (-,y ) by \2.\2\ . Hence, we have 


with 


[Ad-r. r?:.. 

■ .,Yf)yf\=E\f p (Yf,Y^ v . 



= E tW--Ff 



\N 

i-l _ Z P- 1> 


v y;N 

■ ■. Y \ =z i 


g(z P -u. .. ,zt) = E [f p (Yf, Y^,..., Ff' v ) ^-i(F^, f; ;A, )| F£ 

= J' fp(z,Zp- 1 ,.. .,Zl)PN-p(.Z,Zp-l)dz. 

Applying the induction hypothesis for f p \ — g, we obtain 

E \f p (xf, Y y * v ..., Ff vr] =E[g(F^,..., Ff V ) J/fi ] 

r w 

= I ^CVai-p+ 1> ■ • • >FlV-l) ]~[ Pi-lCyi-l»FiMFi-l 
J i=N-p +2 

f * 

= I f P (yN-p,---,yN-i) Y [ p/-i(y,-i,y,'Kv,-i. □ 

^ i=N-p+l 

We now consider an extended integer sequence 

0 < mi < • • • < nik — n* = no < n\ < ■ ■ ■ < «/ = A, 
and a kernel K e of the form 

K e (u) := e~ d K(u/e ), y e R d , 

with A being integrable on W 1 and [1 d K(u)du - 1. Formally K, converges to the delta 
function <5o on 3 d (in distribution sense) as e i 0. We then have the following stochastic 


representation (involving standard expectations only) for (3.1 1 with n, = m k+i , i = 0,..., / 
r-k+l. 
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Theorem 3.2. Let the chain ( K J/j := ( Y y,N , V >;,v ) fee given by [2.14\, and the modified 
integer sequence (n.) be defined by 

(3.4) 7i, i = 1./. 

7f ffeen feo/ofs 

E \g(X nl[ X m f)\ X mo = x , X N = y\ 


= E [g(X m ,..., X mi _, , X°;\X ni ,...,X „,_,)| X mo = x,X N = y] 


(3.5) 


= lim 

6—>0 


E| 

I* 1 

( Y”>0,x v m a,x Y m 0,X v y,N v v;A , 'i 

{ A >’h , . . •, A mkl , A n . J 

1 k € 


l<] 

E| 

[*el 

(^-^1 

IV 

y;AH 

ni \ 



Proof. The proof is analogous to the corresponding one in Bayer and Schoenmakers [ 2013) . 
As a rough sketch, apply Theorem [XT] to 

/(X„°;;\.. .,X^,y 0 ,yu ■ ■ •,?/-1) := • ■ .,X° n f,y u . ■. ,yi-i)K e (y 0 - X 


conditional on X; 


■■X: 


send e —» 0, and divide the result by 
Po.n(x, y) = limE \K e (¥~ - X n .) Vtt,]. 


Forward-Reverse algor ithm. Given Theorem |3.2| the corresponding forward-reverse Monte 
Carlo estimator for (3.51 suggests itself: Sample i.i.d. copies x 0 ' JC ’ (1) ,..., X°’ Xi<M ^ of the pro¬ 
cess X 0,x and, independently, i.i.d. copies of the 

process (Y y,N , )/ v:;V j. Take for K a second order kernel, take for simplicity M — M, and 
choose a bandwidth e M ~ XI l,rf if d < 4, or ~ M~ 2 ^ 4+d) if d > 4. By next replac¬ 
ing the expectations in the numerator and denominator of 0 by their respective Monte 
Carlo estimates involving double sums, one ends up with an estimator with Root-Mean- 


Square error 0(M l/2 ) in the case d <4 and 0(M 4,,(4+4) ) in the case d > 4 (cf. Bayer and 
lSchoenmakers| | |2013| for details). 


4. The forward-reverse EM algorithm 

Let us now formulate the forward-reverse EM (FREM) algorithm in the setting of the 
missing data problem. Suppose that the parameter 0 e 0 c l ! and that the Markov chain 
X = (X„,n G N) has state space R d . Assuming that the transition densities p n k of X exist 
for n,k G N the full-data log-likelihood then reads 

n N -\ 

(4.1) l c (8, x) = £ log p s ii+l ( Xi , x i+ i), x G IT" +1 . 

i=0 

In the missing data problem only partial observations X no ,... ,X„ N are available for 0 = 
no < n [ < ... < n,v with log-likelihood function / given in <0 that is intractable in 
most cases. Instead, the maximization of / in 9 has to be replaced by a two step iterative 
procedure, the EM algorithm. 

E-step: In the m-th step evaluate the conditional expectation of the complete data 
log-likelihood 

Q(9,6 m , x) := E 9m [l c (6, X 0 ,X„ N )\X no = x„ 0 ,..., X nN = x„ N ], x G R N+1 . 

M-step: Update the parameter by 

9 m +i = arg max Q(6,9 m ,X). 
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Since in many Markov chain models the E-step is intractable in this form, we propose a 
forward-reverse approximation for the expectation of the transition densities evaluated at 
the observations. 


FR E-step: Evaluate 

(4.2) Qm(e,e m ,x ) := E™[i c (e,x)\x m ,i = o 


where E™ denotes a forward-reverse approximation of the conditional expectation 
under the parameter 8 m . 


After this FR E-step is computed the M-step remains unchanged. This FREM algorithm 
gives a random sequence (8„)„>o that under certain conditions given in the next section 
converges to stationary points of the likelihood function. To assure a.s. boundedness of 
this sequence we apply a stabilization technique introduce in Chen et al. 11988). 


The stable FREM algorithm. Let K m c 0 for m e N be a sequence of compact sets such 
that 

(4.3) K m c K m+ i and 0 = (J K m 

me N 

for all m e N. We define the stable FREM algorithm by checking if 8,„ after the m-th 
maximization step lies in K m and reseting the algorithm otherwise. Choose a starting value 
8q e K 0 and let p n for n G N, po 0, count the number of resets. 

stable M-step: 

(4.4) 0 m+1 = arg max Q m (8,0 m ,X) and p n+l = p n , if arg max Q m (0,0 m ,X) e K m , 

(4.5) 0 m+ i = 0 O and p n+l = p n + 1, if argmax Q m (0,0 m ,X) £ K m . 

We will show in the next section that under weak assumption the number of resets p„ stays 
a.s. finite. Our stable FREM algorithm consists now of iteratively repeating the FR E-step 
and the stable M-step. 


5. Almost sure convergence of the FREM algorithm 


In this section we prove almost sure convergence of the stable FREM algorithm under 
the assumption that the complete data likelihood is from a curved exponential family. Our 


proof 

is mainly based on results from Bayer and Schoenmakers [2013], Fort and Moulines 

[2003 

and the classical framework for the EM algorithm introduced in Laird and Rubin 

[1977 

and Lange 119951. 


5.1. Model setting. Suppose that <p : 0 —* R, £/ : 0 —> R 9 and S : R (,v+1 )d —■» R 9 are 
continuous functions. We make the structural assumption that the full data log-likelihood 
is of the form 


(5.1) 1(0,x o, ...,x N ) = (f>{8) + (S(x o,.. .,x N ),if/(d)), 

i.e. I is from a curved exponential family. In order to proof convergence we need the 
following properties to be fulfilled that naturally hold in many popular models. In Section 
[6]we give several practical examples that fall into this setting. 

Assumption 5.1. (1) There exists a continuous function 8 : R 9 —> 0 such that l(8(s), s) = 

sup ee0 1(8 , s) for all s e ~BS N+l>d . 

(2) The incomplete data likelihood L is continuous in 8 , and the level sets {0 e 0| L(8. x) > 
C ) are compact for any C > 0 and all x. 
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(3) The conditional expectation Eg[S(Xo,... ,X„ N )\X ng = x„ 0 ,...,X„ N = x nN \ exists 
for all (x„ 0 ,..., x„ N ) G ™t' v+ 1 and 8 e © and is continuous on 0. 


To simplify our notation we will neglect in the following the dependence of l on s. 
Under these assumption we can separate the E- and M-step. In order to do so we define 


g( 8 ) := E e [S (X 0 ,.. .,X„ N )\X na = x„ 0 ,... , X„ N = x„ v ]. 

An iteration of the EM algorithm can now be written as 8 m+ \ — 8 o g( 8 m ) T( 8 m ). Let us 

denote by T the set of stationary points of the EM algorithm, i.e. 


r = {6» e ©|0 O g(ff) = 8 ). 


It was shown in Theorem 2 in |Wu|JT983[ that if 0 is open, </> and <// are differentiable and 
Assumption|5.1|holds, then 


T = {8 e 0| d e l( 8 ) = 0), 


such that the fixed points of the EM algorithm coincide with the stationary points of I. 
In Wu 1 1983| it was proved that the set F contains all limit points of ( 8 n ) and that (1(8,,)) 
converges to 1(6q) for some 8q g E. In the following theorem we extend these results to the 
FREM algorithm. Let d(x,A) be the distance between a point x and a set A. 

For the convergence of our forward-reverse based EM algorithm, we naturally also need 
to guarantee convergence of the corresponding forward-reverse estimators. This can be 
guaranteed by the following assumption (cf. also | Bayer and Schoenmakers 2013) Section 
4]). 


Assumption 5.2. (1) For any multi-indices a,/3 e Nq with \a\ + |/?| < 2 and any index 

i there are constants C\ = C\(i , a,/?), C 2 = Ci(i, a,P) > 0 such that 

\d a x dP y Pi(x,y)\ < C 1 exp(-C 2 |x-y| 2 ). 

(2) S is twice differentiable in its arguments and both S and its first and second deriva¬ 
tives are polynomially bounded. 


Now we are ready to state as the main result of this section a general convergence 
theorem for the FREM algorithm. 

Theorem 5.3. Let (K n )„ e ir satisfy ( |4. 3 [ > and choose 80 G K,,. Suppose that Assumptions |5 . 1 \ 
and \5.2\ hold and that 1(F) is compact, then the stable FREM random sequence ( 8 „)„>0 has 
the following properties: 

(1) lim„ p n < 00 a.s. and ( 8 n ) is almost surely bounded. 

(2) lim„ d(l( 8 „), 1(F)) = 0 almost surely. 

(3) If also l is s-times differentiable, then lim„ d( 8 n ,F) = 0 almost surely. 


Proof. (1) Set 

/>) := E™[S(X 0 ,..., X„ N )\X no = x no , ...,X nN = x„ N ]. 

With the above notation an iteration of the FREM algorithm can be written as 

e m+ i = e(g FR (e m )). 

It was shown in Lemma 2 in Lavielle and Moulines [ 1999) that the incomplete data log- 
likelihood l is a natural Lyapunov function relative to T and to the set of fixed points T. If 
for any e > 0 and compact K c © we have 




-|/°e(s™(fl m ))-z°m,,)IW>e 


(5.2) 


< 00 a.s., 
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then Proposition 11 in Fort and Moulines 1 2003| implies in our setting that lim sup,, p n < oo 
almost surely and that Q n is a compact sequence such that (1) follows. To obtain ( |5.2[ i it is 
sufficient by Borel-Cantelli to prove that 


2 P (|/° d(g FR (O m )) - l o T(6 m )\h m e K > e) 


< OO. 


Dehne for any 6 > 0 an e-neighborhood of K by 


K s :-(xe 


| inf \z - x\ < 5). 

zeK 


By assumption / and 8 are continuous, such that for any 6 > 0 there exists q > 0 such that 
for any x,y e Kg we have \l o 8(x) - l o 0(y)| < e whenever \x-y\ < q. 

Choosing now e = 6 A q yields 

P(\l o 8(g FR (8 m )) - 1 o T(0 m )\lg m£K >e) = P(\lo ~8(g FR (8 m )) - l o 6{g(6 m ))\\ Bm eK > e) 

= P(\lo 6(g FR (e m )) - l o 8{g(8 m ))\l 9meK > e, | g FR (8 m ) - g(8 m )\h m eK < s) 
+P(\l o ~8(g FR (8 m )) - 1 o 8(g(d m ))\l 9m£K > e, | g FR (8 m ) - g(8 m )\lg m€K > d) 

<2P(\g FR (e m )-g(0 m )\lg m€ K>e) 

Markov’s inequality gives then 

p(\l O 8(g FR (8 m )) - / O m„)[ h m eK >e)< 2e- i E[| g FR (0 m ) - g(8 m )\ k l 8m€K 
for some k > 0. 

By | Bayer and Schoentnakers[ 2013( Theorem 4.18] (see also Remark [574] below), we 
can always choose a number N of samples for the forward-reverse algorithm and a corre¬ 
sponding bandwidth e = — A' " such that 

E | g FR (0m) ~ g(fim)\ 1 9 m eK < ~ 

for some constant C. We note that the choice of a depends on the dimension d as well as 
on the order of the kernel. For instance, for d < 4 and a standard first order accurate kernel 
K , we can choose any 1/4 < a < 1/d. 

In any case, if we choose N > m then 


(5.3) 


Yj P {\ 1 ° - I ° ne m )\lg m€K > e) < oo. 


which proves (1). 

To prove (2) and (3) observe that for every K c 0 we have 

lim \l(8 m+ i) - l o T(8 ln )\lg meK = 0 a.s. 

m 

By Borel-Cantelli it is sufficient to prove that 

2 P(|/(0 m+1 ) - / o T(0 m )\lg m e K > e) 


< oo. 


But since we have shown in (1) that p n is finite a.s., we have in the above sum that 9 m+ 1 = 
6(g FR (8 m )) in almost all summands. Hence, it is sufficient to show that 


Y P(mg FR (8 m ))) - 1 O T(0 m )\lg m€K > e ) 


< oo, 


which is nothing else than Q- The statement of (2) and (3) follows now from Sard’s 
theorem (cf. Brockner 1975]) and Proposition 9 in Fort and Moulines [2003]. □ 
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Remark 5.4. In the above convergence proof we need to rely on the convergence proof of 
the forward-reverse estimator when the bandwidth tends to zero and the number of simu¬ 
lated Monte Carlo samples tends to infinity. Such a proof is carried out for the diffusion 
case in |Bayer and Schoenmakers 2013, Theorem 4.18], where also rates of convergence 
are given. We note that the proof only relies on the transition densities of (a discrete skele¬ 
ton of) the underlying diffusion process. Hence, it immediately carries over to the present 
setting. 


Theorem 5.3 is a general convergence statement that links the limiting points of the 
FREM sequence to the set of stationary points of /. In many concrete models the set T of 
stationary points consists of isolated points only such that an analysis of the Hessian of 
l gives conditions for local maxima. A more detailed discussion in this direction can be 
found in Lavielle and Moulinesj jl~999) for example. 


6. Implementation and complexity of the FREM algorithm 


Before presenting two concrete numerical examples, we will first discuss general as¬ 
pects of the implementation of the forward-reverse EM algorithm. For this purpose, let 
us, for simplicity, assume that the Markov chains X and (Y, J/) are time-homogeneous, i.e., 
that p = and q = qi do not depend on time k. We assume that we observe the Markov 
process X at times 0 = (q < ■ ■ ■ < i r = N , i.e., our data consist of the values X, t — x (i , 
k = 0,..., r. For later use, we introduce the shortcut-notation x := (x, ( )l =0 . 

The law of X depends on an s-dimensional parameter 6 e IT', which we are trying to 
estimate, i.e., p = p H . To this end, let 

N 

x 0 ,..., x N ) := ^ log p e (x i -i,x i ) 

i=i 

denote the log-likelihood function for the estimation problem assuming full observation. 
As before, we make the structural assumption that 

n 

(6.1) t{6\ x 0 ,..., x N ) = <p(6) + ^ S j(x o,..., x N )if/j(ff). 

i= 1 


For simplicity, we further assume that there are functions S j such that 

r 

S ;(x 0 , . . . , X N ) = ^ S \ (X; ,. . . , X;,). 

7=1 


The structural assumption (6.1 1 allows us to effectively evaluate the conditional expecta¬ 
tion of the log-likelihood t c for different parameters Q, without having to re-compute the 
conditional expectations. More precisely, recall that for a given guess 6 the E step of the 
EM algorithm consists in calculating the function 


( 6 . 2 ) 


6^ Q(6-,8,x):=E lj [e c (e-,X 0 ,...,X N ))\Xi j =x ir ; = 0,...,r]. 


with Eg denoting (conditional) expectation under the parameter 9. Inserting the structural 
assumption (6.11, we immediately obtain 

m m 

6(0; e,x) = m + Yj ^) E fl [ 5 '(*0, ..., A,v)l X h = X,., ./ = 0,..., r] = m + YjSiiVzl 
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with z e t := E ? [S;(To, • ■ • > Yv)l A,, = x ip j = 0,..., r], i = 1,..., m. Note that the definition 
of does not depend on the free parameter 6. Thus, only one (expensive) round of calcu¬ 
lations of conditional expectations is needed for a given 6 , producing a cheap-to-evaluate 
function in 6 , which can then be fed into any maximization algorithm. 

For any given 9, the calculation of the numbers z®,..., z e m requires running the forward- 
reverse algorithm for conditional expectations. More precisely, using the Markov property 
we decompose 

A := %[S;(X 0 ,.. - ,X N )\Xi. = x ip 7 = 0.... ,r] 

= f J E~ 0 [sl(.X ij _ 1 ,...,X i .)\x iH =x ij _ l ,X ij = Xij . 
j= i 

All these conditional expectations are of the Markov-bridge type for which the forward- 
reverse algorithm is designed. Hence, for each iteration of the EM algorithm, we apply 
the forward-reverse algorithm r times, one for the time-intervals ij \,... ,ij, j = 1,..., r, 
evaluating all the functionals k' p ..., h J m at one go. 


6.1. Choosing the reverse process. Recall the defining equation for the one-step transi¬ 
tion density q of the reverse process given in \2.\2) . For simplicity, we shall again assume 
that the forward and the reverse processes are time-homogeneous, implying that \2.\2\ can 
be re-expressed as 


q(y,z ) 


p(z,y) 
y, z)' 


Notice that in this equation only p is given a-priori, i.e., the user is free to choose any 
re-normalization <// provided that for any y € If' 1 the resulting function z i-> q(y,z) is non¬ 
negative and integrates to 1. In particular, we can turn the equation around, choose any 
transition density q and define 


fi(y, z) 


p(z,y ) 
q(y, z)' 


Note, however, that for the resulting forward-reverse process square integrability of the 
process J/ is desirable. More precisely, only square integrability of the (numerator of the) 
complete estimator corresponding to ( |3.5| > is required, but it seems far-fetched to hope 
for any cancellations giving square integrable estimators when J/ itself is not square inte- 
grable. From a practical point of view, it therefore seems reasonable to aim for functions 
tjj satisfying 


l 


in the sense that t// is bounded from above by a number slightly smaller than 1 and bounded 
from below by a number slightly smaller than 1. Indeed, note that J/ is obtained by mul¬ 
tiplying terms of the form Y n+ \) along the whole trajectory of the reverse process Y. 
Hence, if i// is bounded by a large constant, J/ could easily take extremely large values, 
to the extent that buffer-overflow might occur in the numerical implementation - think of 
multiplying 100 numbers of order 100. On the other hand, if <// is considerably smaller 
than 1, J/ might take very small values, which can cause problems in particular taking 
into account the division by the forward-reverse estimator for the transition density in the 
denominator of the forward-reverse estimator. 

Heuristically, the following procedure seems promising. 
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If y i—> JT p(z,y)dz can be computed in closed form (or so fast that one can think 
of a closed formula), then choose 


<A00 := My, z) 


-f 


p(z,y)dz. 


Otherwise, assume that we can find a non-negative (measurable) function ])(z,y ) 
with closed form expression for J ?rl p(z, y)dz such that p(z,y) ~ J>(z,y). Then 
define 

, , p(z,y) 
q(y,z) := . , 

J Ttl , p(z,y)dz 

which is a density in z. By construction, we have 


My, z) 


p(z,y) 


f 


~ ( ,, p(z,y) 

p(z,y)dz —— 
ri p(z,y) 


q(y,z ) 

implying that we are (almost) back in the first situation. 

Remark 6.1. Even if we can, indeed, explicitly compute if/(y,z ) = f. r , p(z, y)dz„ there is 
generally no guarantee that J/ has (non-exploding) finite second moments. However, in 
practice, this case seems to be much easier to control and analyze. 

6.2. Complexity of the forward-reverse algorithm. We end this general discussion of 
the forward-reverse EM algorithm by a refined analysis of the complexity of the forward- 


reverse algorithm for conditional expectations as compared to Bayer and Schoenmakers 


]2013|. We start with an auxiliary lemma concerning the maximum product of numbers of 


two species of balls in bins, which is an easy consequence of a result by Gonnet Gonnet 
1 1981| , see also | |Sedgewick and Flajolet| 1996| Section 8.4]. 

Lemma 6.2. Let X be a random variable supported in a compact set D c W 1 with a 
uniformly bounded density p. For any K 6 N construct a partition B .... Bof D in 
measurable sets of equal Lebesgue measure A(Bf) = A(D)/K, i — 1,..., K. Finally, for 
given N € N let X\,..., Xn be a sequence of independent copies ofX and define 

N k :=#{/€ {1,..., N) | Xi&Bf), k=\,...,K. 

For N,K oo such that N = 0(K ) we have the asymptotic relation 


max Nk 

k=l,...,K 


= o 


log N \ 

log log IV ) 


Proof. Let 

Pk -.= P(Xe Bf) < HpIL IK 

and observe that the random vector (N\, Nk) satisfies a multi-nomial distribution with 
parameters K , N and (p i,..., pM). 

The proof for the statement in the special case of p\ = ... - p K = l/K is given 
in Gonnet Gonnet| fl981| , so we only need to argue that the relation extends to the non- 
uniform case. To this end, let K' [K/ |/;||, K J and let (Mi ,.... M K t) denote a multi-nomial 
random variable with parameters N, K' and (1 /K', ...,1/A''). As /V = O(K') we have by 
Gonnet’s result that 


E 


max Mk 

k=\,...,K' 


= 0 


log^V \ 


^log \ogN) 

Moreover, it is clear that Efmax^i,...,^ Nk] < E [max^i.f/ Mk] and we have proved the 

assertion. □ 
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Theorem 6.3. Assume that the transition densities p and q have compact support in 
Moreover, assume that the kernel K is supported in a ball of radius R > 0. Then the 
forward-reverse algorithm for N foni’ard and reverse trajectory based on a bandwidth 
proportional to N _1 i d can be implemented in such a way that its expected cost is 0(N log N) 
as N —> oo. 

Proof In order to increase the clarity of the argument, we re-write the double sum in the 
forward-reverse algorithm to a simpler form, which highlights the computational issues. 
Indeed, we are trying to compute a double sum of the form 

i =1 7=1 

where P LJ obviously depends on the whole /th sample of the forward process X and on the 
whole /th sample of the reverse process 

We may assume that the end points X' n , and K/ of the N samples of the forward and 
reverse trajectories are contained in a compact set [-L, L] d . (Indeed, the necessary re¬ 
scaling operation can obviously be done with O(N) operations.) In fact, for ease of notation 
we shall assume that the points are actually contained in [0, l] d . We sub-divide [0, \] d in 
boxes with side-length S e, where S > R is chosen such that 1 /(S e) e N. Note that there 
are K := ( Sef d boxes which we order lexicographically and associate with the numbers 
1,..., K accordingly. 

In the next step, we shall order the points X' n , and Yl into these boxes. First, let us 
define a function f : [0, l] rf —> {1,..., 1 /(S e)} d by setting 

/i(x):=(rxi/(56)l,...,rx d /(5e)l), 

with ["■] denoting the smallest integer larger or equal than a number. Moreover, define 
f 2 :{l,...,l/(Se)} d ^{l,...,K}by 

fii.ii,. ■ . , id) := if - 1)(S e)~ d+l + (i 2 - 1 )(Sey d+2 + ■ • ■ + {i d - 1) + 1. 

Obviously, a point x e [0, Y\ d is contained in the box number k if and only if fiif\ (x)) = f 
Now we apply a sorting algorithm like quick-sort to both sets of points (xj r ,..., X *) and 
(Yl',Yf) using the ordering relation defined on [0, l]^ x [0, Y] d by 

< y : hifix)) < fi i f ] (>’)). 

Sorting both sets incurs a computational cost of 0(N log N), so that we can now assume 
that the vectors XL and Yl are ordered. 

n fi/ 

Notice that K, (x - v) 4- 0 if and only if and y are situated in neighboring boxes, 
i.e., if \fi(x) — /iCy)loo ^ 1, where we define laj^, := max^i ^/ |(r i j for multi-indices a. 
Moreover, there are 3 d such neighboring boxes, whose indices can be easily identified, in 
the sense that there is a simple set-valued function f which maps an index k to the set of 
all the indices f 2 (k) of the at most 3 d neighboring boxes. Moreover, for any k e {1,At} 
let X" d ) be the first element of the ordered sequence of X\ f lying in the box k. Likewise, 
let Y J } k) be the first element in the ordered sequence lying in the box with index k. 
Note that identifying these 2 K indices 1(1),..., i(K ) and j( 1),.... j(K) can be achieved at 
computational costs of order O(KlogN) = OiN log 7Vj. 

'Obviously, this assumption can be weakened. 

2 

To make this construction fully rigorous, we would have to make the boxes half-open and exclude the 
boundary of [0, l] d . 
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After all these preparations, we can finally express the double sum ( |6.3| ) as 

N N K i(k+ 1)-1 j(r+ 1)-1 

( 6 . 4 ) 

i= 1 j= 1 k= 1 i=i(k) j=j(r) 

Regarding the computational complexity of the right hand side, note that we have the 
deterministic bounds 


K = 0(N ), 

\m)\ < 3 d . 


Moreover, regarding the stochastic contributions, the expected maximum number of sam¬ 
ples X' n , (L, respectively) contained in any of the boxes is bounded by <9(log N) by 
Lemma[tv2l i.e.. 


max (j(r + 1) ■ 


• J(r)) 


0(\ogN), 


which then needs to be multiplied by the total number N of points X' n , to get the complexity 
of the double summation step. □ 


Remark 6.4. As becomes apparent in the proof of Theorem |6. 3 1 the constant in front of the 
asymptotic complexity bound does depend exponentially on the dimension d. 


Remark 6.5. Notice that the box-ordering step can be omitted by maintaining a list of all 
the indices of trajectories whose end-points lie in every single box. The asymptotic rate of 
complexity of the total algorithm does not change by omitting the ordering, though. 


7. Applications of the FREM algorithm 


The forward reverse EM algorithm is a versatile tool for parameter estimation in dy¬ 
namic stochastic models. It can be applied in discrete time Markov models, but also in the 
the setting of discrete observations of time-continuous Markov processes such as diffusions 
for example. 

In this section we give examples from both worlds: we start by a discretized Ornstein- 
Uhlenbeck process that serves as a benchmark model, since the likelihood function can be 
treated analytically. Then we give an example of a partially hidden Markov model that 
is motivated be applications in system biology in Langrock and King 1 2013) . Finally, we 
remark on limitations of the EM algorithm when the log-likelihood is not integrable and 
demonstrate these limitations in the context of a Cox-Ingersoll-Ross process. For a com¬ 
plex real data application of our method we refer the interested reader to the forthcoming 
companion paper Bayer et al. |20l5) . 


7.1. Omstein-Uhlenbeck dynamics. In this section we apply the forward-reverse EM 
algorithm to simulated data from a discretized Ornstein-Uhlenbeck process. The corre¬ 
sponding Markov chain is thus given by 

(7.1) X n+ i = X„ + AX n At + AW n+1 , n > 0 

where W„ are independent random variables distributed according to N( 0, At). The drift 
parameter A e R is unknown and we will employ the forward reverse EM algorithm to 
estimate it from simulated data. The Ornstein-Uhlenbeck model has the advantage that the 
likelihood estimator is available in closed form and we can thus compare it to the results 
of the EM algorithm. 
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At 

N 

bandwidth 

mean A 

std dev /I 

likel. 

std dev likel. 

0.1 

2000 

0.0005 

0.972 

0.0135 

-3.402 

0.00290 


8000 

0.000125 

1.098 

0.00841 

-3.383 

0.00062 


32000 

3.125e-05 

1.132 

0.00476 

-3.381 

0.000123 


128000 

7.812e-06 

1.151 

0.00236 

-3.381 

2.783e-05 


512000 

1.953e-06 

1.157 

0.00117 

-3.381 

4.745e-06 


2048000 

4.882e-07 

1.159 

0.000581 

-3.381 

1.005e-06 

0.05 

2000 

0.0005 

1.160 

0.0141 

-3.107 

0.000854 


8000 

0.000125 

1.247 

0.00872 

-3.103 

9.867e-05 


32000 

3.125e-05 

1.253 

0.00468 

-3.103 

1.329e-05 


128000 

7.812e-06 

1.265 

0.00225 

-3.103 

3.772e-06 


512000 

1.953e-06 

1.265 

0.00111 

-3.103 

6.005e-07 


Table 1. Behavior of the forward-reverse EM algorithm for a discretized 
Ornstein-Uhlenbeck model for different step sizes At, initial guess A = 
0.5 and true MLE Tmle = 1.161 and 1.266, respectively. 


In each simulation run we suppose that we have known observations 

X[),X\oAr, ■ ■ ■ -2f4QA f 

for varying step size At and use the EM methodology to approximate the likelihood func¬ 
tion in between. We perform six iteration of the algorithm with increasing number of data 
points N. 

In table Q] we summarize the results of two runs for the discrete Ornstein-Uhlenbeck 
chain. The mean and standard deviation are estimated from 1000 Monte Carlo iterations. 
We find that already after three steps the mean is very close to the corresponding estimate 
of the true MLE. This indicates a surprisingly fast convergence for this example. Note 
also that the approximated value of the likelihood function stabilizes extremely fast at the 
maximum. 

Table |2]gives results for the same setup as in Table [T]but with initial guess A = 2 such 
that the forward-reverse EM algorithm converges from above to the true maximum of the 
likelihood function. We observe that the smaller step size At = 0.05 results in a more 
accurate approximation of the likelihood and also of the true MLE. It seems that the step 
size has crucial influence on the convergence rate of the algorithm, since for At = 0.05 the 
likelihood stabilizes already from the second iteration. 

In Figure[T]the empirical distribution of 1000 estimates for A is plotted. The initial value 
was 0.5 and the true maximum of the likelihood function is at 1.161. The step size between 
observations was chosen to be At = 0.1. The histogram on the left shows the estimates after 
only one iteration and on the right the estimates were obtained from five iterations of the 
forward-reverse EM algorithm. 

Figure [2] depicts the distribution of 1000 Monte Carlo samples of the likelihood val¬ 
ues that led to the estimates in Figure [T] It is interesting to see that after one iteration of 
the algorithm the likelihood values are approximately bell shaped (left histogram) whereas 
after five iterations the distributions becomes more and more one-sided as would be ex¬ 
pected, since the EM algorithm only increase the likelihood from step to step towards the 
maximum. 

Figure[3]shows the convergence of the forward reverse EM algorithm when the number 
of iterations increases. We find that already after 4 iterations the estimate is very close to 
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At 

N 

bandwidth 

mean A 

std dev A 

likel. 

std dev likel. 

0.1 

2000 

0.0005 

1.554 

0.0353 

-3.457 

0.0134 


8000 

0.000125 

1.312 

0.0127 

-3.393 

0.00221 


32000 

3.125e-05 

1.217 

0.00544 

-3.382 

0.000351 


128000 

7.812e-06 

1.185 

0.00245 

-3.381 

5.817e-05 


512000 

1.953e-06 

1.168 

0.00121 

-3.381 

1.227e-05 

0.05 

2000 

0.0005 

1.390 

0.0248 

-3.108 

0.00238 


8000 

0.000125 

1.289 

0.00925 

-3.103 

0.000130 


32000 

3.125e-05 

1.261 

0.00471 

-3.103 

1.451e-05 


128000 

7.812e-06 

1.266 

0.00221 

-3.103 

2.538e-06 


512000 

1.953e-06 

1.266 

0.00113 

-3.103 

5.855e-07 


Table 2. Behavior of the forward-reverse EM algorithm for a discretized 
Ornstein-Uhlenbeck model for different step sizes At, initial guess A — 2 
and true MLE Tmle = 1.161 and 1.266, respectively. 



0.94 0.96 0.98 1.00 1.02 

Estimated values 



1.158 1.159 1.160 1.161 1.162 


Estimated values 


Figure 1. Empirical distribution of 1000 estimates after one iteration 
(right) and after five iteration (left) of the forward-reverse EM algorithm. 


the true MLE for A. After six iterations the algorithm has almost perfectly stabilized at the 
value of the true MLE A = 1.16. 


7.2. Hidden Markov models. Our forward reverse EM algorithm can also be applied 
in the context of hidden Markov models (HMM). A typical HMM consists of observed 
state process and a hidden (i.e. unobserved) Markov chain that models the internal regime 
switching of the state process (see for example MacDonald and Zucchini 11997 ] for a 
general introduction). When for the hidden process only some values can be observed we 
speak of a partially HMM. 

Recently, partially HMMs have been applied with some success in the modeling of eco¬ 
logical systems to understand the relation between animal populations and environmen¬ 
tal parameters in a dynamical setting. The example considered here is a partially hidden 
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-3.410 -3.405 -3.400 -3.395 



-3.381055 -3.381052 -3.381049 


Likelihood 


Likelihood 


Figure 2. Empirical distribution of the likelihood values of 1000 Monte 
Carlo samples after one iteration (right) and after five iteration (left) of 
the forward-reverse EM algorithm. 


Markov model and is inspired by applications in modeling mark-recapture-recovery data 
as discussed in jLangrock and Kingj 12013) for example. In mark-recapture-recovery stud¬ 
ies every individual of a population is marked by some tag or ring that uniquely identifies 
the individual and some properties of interest are measured (e.g. weight, age etc). In sub¬ 
sequent surveys each individual is either recaptured such that the measurements could be 
taken again or if it can not be recaptured this will be recorded as a missing data point. 
Hence, we obtain a sequence of data with missing values that leads naturally to a partially 
HMM. 

Let us consider a two-dimensional Markov chain X„ = (X^, X 2 ). For the first component 
N observations Xl ,... ,X l N are given, whereas the second component X 2 is only observed 
partially after each k e {1,..., N] time steps, i.e. 


y2 -y2 -y2 

A 0’ A k’ A 2k’ • * 



are known. Suppose that the one-step transition probabilities are normal distributions: 

£(X„ +1 \X„) = N(/(X„),E), 


with mean given by fi e (x) = 6 + x 2 for an unknown parameter 9 e K 2 . The covariance 
matrix 2 e R 2x2 is assumed to be constant. 

In this setup the log-likelihood function based on full observations X is given by 


IMX o, ...,X N ) = 1 - 27rdet(2) 1/2 - \ Y Yf. 

1 i=i 


where 

Yf = (x, - Ax,-i)) t n (x, - Axt-i)) 

with precision matrix O = (a>ij) '■= 2 1 . In coordinates we thus have 
Yf = «n((//,i)“ + (2a> i2)f/;,if//,2 + < i)22Uf 2 , 
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Number of forward-reverse EM steps 

Figure 3. Convergence of the forward-reverse EM algorithm from one 
to six iterations for each 1000 estimates of A. The value of the true MLE 
is if = 1.161. 

where := X', - 8, - (X'j_ l ) 2 . Calculating the score function gives therefore 

Hx wi1 ( 2(x ' )2+ei+2(x '-^ 2 ) + 2wi2 {° 2 - x > + (x /-i )2 ) 

and 

^ = -\ 2 ^22 (2 (Xff +e 2 + 2 (X^) 2 ) + 2a>i2 (e 1 - xj + (X ^) 2 ). 

Since not all values of X 2 are observed, this log-likelihood cannot be maximized directly. 
Instead, the forward reverse algorithm approximates the expected log-likelihood given the 
partial observations. The E-step in this model reads as follows. 

FR E-step. Evaluate the forward-reverse approximation 

QJO, e m ,X l ,...,X N ) = E g FR [l c (6, X 0 ,..., X n )\X^ .... 4; 4, i = 0,..., [N/ki] . 
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This can be rewritten as 


By approximating directly the score function this can be further simplified to 




N 


E™ 2 ^n((Xf ) 2 + (Z/_,) 2 )2 anUXlif - Xf)\X l Q , ...,X l N ; X 2 k , * = 0. [N/k\ 


and 



N 


E™ 2 2^22 ((Xf ) 2 + (4_,) 2 )2 ajndxlt ) 2 - Xj)\X l 0 ,... ,X' N ;X 2 k , i = 0,..., [N/k\ 


Due to the linearity of the score function in 8 the M-step is now straightforward. 


M-step. Solve the linear system 


Q d °'(e,e m ,x x ,...,x N ) = o 
Q d m e 2 (e,e m ,x u ...,x N ) = o 


(7.2) 

(7.3) 


for 6 1 and di to update 8 m+ \ = ( 0 \, 62 ). 

Remark 7.1. The example given here can also be extended to HMMs with more involved 
likelihood structure. In particular, the case of a non linear score function in 6 can easily 
be treated with standard numerical methods such that also in these examples the M-step 
remains feasible (see also |Liu and Rubin]]1994| and |Meng and Rubtn| | |1993) ). 

7.3. A final note of warning by a discrete Cox-Ingersoll-Ross example. Consider the 
Markov chain given by 


X n+1 = X n + A (0 - X„) A t + cr |X„r AW n+u 


(7.4) 


where At is fixed and AVV'„ are independent random variables distributed according to 
A/YO, At). Moreover, we assume that 0 < y is fixed and known. The other parameters 
<x, A and 8 are unknown and need to be estimated. In the case y - 1/2 it corresponds a of 
Euler discretization of the Cox-Ingersoll-Ross model from finance. 

Up to constant terms (in the un-known parameters cr, A and 8 ), the log-likelihood func¬ 
tion of a sequence of observations x = (xq, ..., x N ) of the full path of the process X is given 
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by 


. (cr, A, (9; x) = log 


N 


n p(xt- 

t,*i) 

i=l 



1 v-i (*; - (1 - - A9At) 

= -«‘og^-2^L- 

1= 1 


l*i- 


i 2 y 


-N log cr ■ 


1 


N 

2cr 2 At 2 
1 = 1 


l*i-i 


[2y 


•2(1 - AAt) 


XjXj -1 

l*;-t| 2y 


Xi ^ xf , 

- 2/10Ar-^ + (1 - /lAf) 2 


l*;-il 2y 
+ 2/l0Af(l - /lAf) 


*/-i 

*;-i| 2y 


l*,-i I 2r 
■ A 2 9 2 At 2 


l*;-i| 2y - 


Assume that we have given partial observations X io ,... ,X ir with z'o — ()<■■■< i r - N, 
while the remaining points Xj, j $ {z'q, ..., i r }, are assumed to be unobserved. Define 
random variables Zq := N and 


i= 1 
N 

i=l 

N 


X 2 


N 

- v 

X 


|x,-il 2y ’ 

^2 •' 

Zj 

i=l 

K-i 

l 2y ’ 

Xi-iXt 

7 a • 

N 

- y 

l 


K-il 27 ’ 

Z.4 .■ 

i= 1 

K_i 

| 2 r ’ 

A,-, 

7, * 

N 

- v 

X 2 

i- 

1 

|X;-il 2y ’ 

Z/6 • 

" Zj 

i= 1 

K-t 

| 2y " 


Hence, we have with X = (X {] ,..., X N ) 

1 


. (cr, A, 9; X) = -Z 0 log cr • 


2cr 2 At 


[Zj - 2A9AtZ 2 - 2(1 - TAf)Z 3 


+ A 2 9 2 At 2 Z 4 + 2AGAt(l - AAt)Z 5 (l - AAt) 2 Z 6 ]. 
Then we do the E-step. Given guesses cr", A", 6" for the parameters, let 
(7.5) Zi := E[Z,|X, 0 = x i(l , ...,X ir = x ir ], i= 1,... ,6, 

and observe that 

g(o-, d, 6>; cr", A n , ff'\ Xj 0 ,, x ir ) := [ 4 (cr, T, 9; X)| X, 0 =x io ,..., X ir = x, r ] 

1 


= -ZO log CT - 


2cr 2 At 


[zi - 2A9Atzi - 2(1 - AAt)z3 


+ A (rAt za + 2A9At( 1 - dAf)r 5 + (1 - AAt) zb\- 


Now the trouble is that for y > 1/2 some of the expectations in (7.5 1 , in particular z 4 , may 
fail to exist. As such, this example shows that in certain cases the expectation of the log- 
likelihood statistic in the EM algorithm does not exist and that, as a consequence, the EM 
algorithm can not be applied. We underline that existence of the log-likelihood expectation 
is a premises for the EM algorithm in general and is not related to the particular approach 
presented in this paper. 
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In the case y < 1/2 the expectations in (7.5 i do exist and we may proceed with first 
order conditions for finding the maximum of 


We have that 


(cr, A, 8 ) h-> Q(cr , A, 8 \ a". A", 8 ";xj 0 ,..., x lr ). 

1 


O ^-0 

daQ = + a 2 At 


[zi - 2 A 8 AtZ 2 - 2(1 - /lAf)z 3 
+ A 2 e 2 At 2 z 4 + 2 A 6 At( 1 - AAt)z 5 (l - AAt) 2 z 6 ], 
diQ = - 


1 


d e Q = 


[-26AtZ2 + 2Atzs + 2 A 6 2 At 2 Z 4 
- - 2 AAt)z 5 - 2Af(l - /!Af)z6], 

\- 2 Atz 2 + 2 A 6 At 2 z 4 + 2Ar( 1 - AAlfe], 


2 cr 2 At 

and we so obtain the maximizers given by 

, ZXZ4 ~ 2Z2Z3Z5 + Zizl + zlz6 ~ 21^4^6 

cr = - - — 1 --- 

AtZo (Zj - Z4Z6) 

Z3Z4 ~ Z2Z5 + ’5 - Z 4 Z 6 

At - Z4Z6) 

Z3Z5 - Z2Z6 


A = 


8 = ■ 

Z3Z4 - Z2Z5 + Z5 - Z4Z6 

For the forward-reverse algorithm, we finally need to specify the reverse chain. In this 
case, we propose to take the reverse chain 

(7.6) Y n+ 1 = Y n - A (8 - Y„) At + cr \Y n \ y AW n+i . 

In order to get the dynamics of J/, we need to derive the normalization function i// between 
the one-step transition densities p of the forward and q of the reverse processes. (We sup¬ 
press the indices as we are in a time-homogeneous situation.) For ( |7.4| > together with ( | 7.6 [ > 
the one-step transition densities are normal densities in the forward variables. 


p(x,y) = 
q(y,z ) = 


1 


V 2nAtcr \x\ y 

1 

yl2nAtcr |y| 7 


exp 


exp - 


(y — x — A (8 — x)A ty 
2cr 2 |x| 2y At 
(z - y + A (8 - y)At ) 2 
2cr 2 \y\ 2y At 


Hence, we get 
(7.7) 

<A(y, z) 


p(z,y ) _ 

>’ 

r 1 

C y-z- A (8 - z)A t ) 2 

(z-y + A( 8 - y)At) 2 1) 

q(y,z ) 

z 

^ \ Act 1 At 

Iz| 2y 

lyl 2y \) 
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